arXiv:  1305. 3668 vl  [cs.SI]  16  May  2013 


Mining  for  Geographically  Disperse  Communities  in  Social 
Networks  by  Leveraging  Distance  Modularity 


Paulo  Shakarian 
Network  Science  Center  and 
Dept,  of  Electrical  Engineering 
and  Computer  Science 
U.S.  Military  Academy 
West  Point,  NY  10996 
paulo@shakarian.net 


Patrick  Roos 
Dept,  of  Computer  Science 
University  of  Maryland 
College  Park,  MD  20721 

roos@cs.umd.edu 


Devon  Callahan, 

Cory  Kirk 

Network  Science  Center  and 
Dept,  of  Electrical  Engineering 
and  Computer  Science 
U.S.  Military  Academy 
West  Point,  NY  10996 
devon.callahan@usma.edu 
cory.kirk@usma.edu 


ABSTRACT 


1.  INTRODUCTION 


Social  networks  where  the  actors  occupy  geospatial  loca¬ 
tions  are  prevalent  in  military,  intelligence,  and  policing  op¬ 
erations  such  as  counter-terrorism,  counter-insurgency,  and 
combating  organized  crime.  These  networks  are  often  de¬ 
rived  from  a  variety  of  intelligence  sources.  The  discovery 
of  communities  that  are  geographically  disperse  stems  from 
the  requirement  to  identify  higher-level  organizational  struc¬ 
tures,  such  as  a  logistics  group  that  provides  support  to  var¬ 
ious  geographically  disperse  terrorist  cells.  We  apply  a  vari¬ 
ant  of  Newman-Girvan  modularity  to  this  problem  known 
as  distance  modularity.  To  address  the  problem  of  finding 
geographically  disperse  communities,  we  modify  the  well- 
known  Louvain  algorithm  to  find  partitions  of  networks  that 
provide  near-optimal  solutions  to  this  quantity.  We  apply 
this  algorithm  to  numerous  samples  from  two  real-world  so¬ 
cial  networks  and  a  terrorism  network  data  set  whose  nodes 
have  associated  geospatial  locations.  Our  experiments  show 
this  to  be  an  effective  approach  and  highlight  various  practi¬ 
cal  considerations  when  applying  the  algorithm  to  distance 
modularity  maximization.  Several  military,  intelligence,  and 
law-enforcement  organizations  are  working  with  us  to  fur¬ 
ther  test  and  field  software  for  this  emerging  application. 
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In  recent  years,  fueled  by  the  connectivity  of  our  social 
world  and  technological  advances  that  allow  for  effortless 
collection  of  connectivity  data,  much  effort  has  been  invested 
in  developing  algorithms  for  the  detection  of  communities  in 
networks  (e.g.  11  18}  [IT]  [t[  [3j  19  9}  [5] ) .  The  detection  of 
communities  -  subsets  of  nodes  that  are  highly  connected  in 
globally  sparser  networks  -  provides  important  insights  into 
the  organization  of  networks  and  related  hidden  information 
of  social  networks  [Tl  . 

In  many  application  domains,  apart  from  the  social  net¬ 
work  information  provided  by  connectivity  data,  geospa¬ 
tial  information  is  available  as  well,  and  community  detec¬ 
tion  algorithms  can  be  improved  by  leveraging  such  spa¬ 
tial  information.  Social  networks  where  the  actors  occupy 
geospatial  locations  are  prevalent  in  military,  intelligence, 
and  policing  operations  such  as  counter-terrorism,  counter¬ 
insurgency,  and  combating  organized  crime.  These  networks 
are  often  derived  from  a  variety  of  intelligence  sources.  Com¬ 
munity  detection  algorithms  that  specifically  detect  geograph¬ 
ically  dispersed  communities  are  of  interest  in  such  applica¬ 
tion  domains  to  identify  higher-level  organizational  struc¬ 
tures,  such  as  a  logistics  group  that  provides  support  to 
various  geographically  disperse  terrorist  cells.  Such  com¬ 
munities  may  be  less  obvious  in  solely  the  available  social 
network  data.  Hence,  in  order  to  find  geographically  dis¬ 
persed  communities,  there  exists  a  need  for  community  de¬ 
tection  algorithms  that  are  optimized  considering  geospatial 
information  in  addition  to  social  network  information,  and 
we  address  this  need  in  this  paper. 

Blondel  et  al.  [5]  have  developed  a  heuristic  method  known 
as  the  Louvain  algorithm  that  partitions  a  social  network 
into  communities  while  optimizing  Newman-Girvan  mod¬ 
ularity  of  the  partition.  Newman-Girvan  modularity  is  a 
common  performance  measure  in  community  detection  al¬ 
gorithms  that  gives  a  measure  of  how  densely  the  detected 
communities  of  the  partition  are  connected  relative  to  con¬ 
nections  between  these  communities  [18].  More  specifically, 
the  modularity  measure  is  the  “fraction  of  edges  within  com¬ 
munities  in  the  observed  network  minus  the  expected  value 
of  that  fraction  in  a  null  model,  which  serves  as  a  reference 
and  should  characterize  some  features  of  the  observed  net¬ 
work”  [l4|. 

In  this  paper,  we  use  a  variant  of  Newman-Girvan  modu- 
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larity  with  the  Louvain  algorithm  to  address  the  problem  of 
mining  for  geographically  dispersed  communities  in  appli¬ 
cation  domains  where  geospatial  information  is  pertinent. 
Instead  of  the  original  null  model  used  in  Newman-Girvan 
modularity,  we  leverage  a  null  model  introduced  by  Liu  et  al. 

14  .  The  use  of  this  model  results  in  the  distance  modularity 
measure  of  community  structure. 

We  test  the  algorithm  on  two  real-world  location-based  so¬ 
cial  networks  and  a  network  from  a  transnational  terrorism 
data  set,  the  nodes  of  which  have  associated  geospatial  loca¬ 
tions.  Our  experiments  show  that  this  approach  is  effective 
at  finding  partitions  of  networks  that  provide  near-optimal 
solutions  to  distance  modularity.  We  also  highlight  various 
practical  considerations  when  applying  the  algorithm  with 
these  definitions  of  modularity.  By  testing  the  algorithm 
on  a  social  network  that  is  significantly  larger  (ca.  2100 
nodes)  than  the  test  networks  commonly  used  in  the  liter¬ 
ature  on  community  detection  algorithms  (typically  ;$  600 
nodes),  we  also  better  demonstrate  scalability.  Further,  our 
results  on  the  transnational  terrorism  network  provide  some 
insight  into  how  our  approach  will  function  on  the  often  clas¬ 
sified  datasets  of  our  target  application.  Currently,  we  are 
working  with  several  organizations  in  the  U.S.  Department 
of  Defense  and  the  law  enforcement  communities  to  further 
study  and  transition  this  technology. 

Next,  in  Section  2,  we  cover  some  technical  preliminar¬ 
ies,  including  definitions  of  modularity.  Section  3  describes 
the  Louvain  algorithm  and  our  modifications  to  it  to  opti¬ 
mize  for  distance  modularity.  Section  4  describes  our  exper¬ 
iments  and  results  on  various  data  sets  and  an  application 
to  transnational  terrorism.  We  review  and  place  our  work 
within  related  work  in  Section  5,  and  finally  we  conclude  in 
Section  6. 


2.  TECHNICAL  PRELIMINARIES 

Throughout  this  paper,  we  shall  model  a  network  as  an 
undirected  graph  G  =  (V,  E)  where  V  is  a  set  of  nodes 
and  E  is  a  set  of  relationships  among  nodes.  We  use  n,  m 
to  represent  the  cardinalities  of  V,  E  respectively.  As  the 
graph  is  undirected,  we  shall  assume  that  ( Vi,Vj )  £  E  im¬ 
plies  (vj,Vi)  £  E.  We  also  assume  that  each  edge  (n,  Vj) 
has  an  associated  weight  wy  (again  Vi,j,Wij  =  Wji).  For  a 
given  node  Vi  £  V,  rji  =  {vj\(vi,Vj)  £  if  V  ( Vj,Vi )  £  E}  and 

hi  —  |  Tji  | . 

We  shall  use  the  notation  C  =  {ci, . . .  ,cq}  to  denote  a 
partition  over  set  V  where  each  d  £  C  is  a  subset  of  V,  for 
any  c;,  Cj  £  C,  Ci  f I  Cj  =  0  and  (J;  ci  =  V. 

For  a  given  partion,  C,  the  modularity  M(C)  is  a  num¬ 
ber  in  [—1,1]  .  The  modularity  of  a  network  partition  can 
be  used  to  measure  the  quality  of  its  community  structure. 
Originally  introduced  by  Newman  and  Girvan.  L8]  this  met¬ 
ric  measures  the  density  of  edges  within  partitions  compared 
to  the  density  of  edges  between  partitions.  A  formal  defini¬ 
tion  of  this  modularity  (henceforth  referred  to  as  NG  mod¬ 
ularity)  for  an  undirected  network  is 

Definition  2.1  (NG  modularity  18]).  Given  parti¬ 
tion  C  =  {ci, . . . ,  c9},  NG  modularity, 

cGC  i,j£c 

where  Pij  =  . 


Here,  the  null  model  used  as  a  reference  for  comparison  to 
a  given  partition  assumes  edges  are  rewired  randomly,  while 
the  degree  sequence  of  the  input  network  is  preserved,  hence 

» — .  fci  fc  n 

Ai  =  ~2ET- 

Recently,  a  measure  for  modularity  that  accounts  for  dis¬ 
tance,  as  well  as  network  topology,  was  introduced  by  Liu  et 
al.  [l4] .  Their  modularity,  henceforth  referred  to  as  distance 
modularity,  is  defined  as  follows: 


Definition  2.2  (Distance  Modularity  14]).  Given 
partition  C  =  (ci, ... ,  c9} ,  distance  modularity, 

Mdist  (C)  =  2^  ^2  Wij  ~  Pij 

cEC  i,jEc 


where  Pij 
/:5R+^ 


_  Pij+Pji  ri  _  kikj  f(d(vi,Vj)) 

-  2  -  -  J2VqevkqfWvq,Vi))> 

(0, 1]  is  the  distance- decay  function. 


and 


The  basic  idea  behind  this  distance  modularity  is  that 
each  node  exerts  a  force  on  other  nodes  by  generating  a 
field,  and  the  potential  of  the  field  at  any  point  decreases 
with  distance  from  the  field  source  (the  node  generating  the 
field),  depending  on  the  distance  decay  function  13,  |14|. 
The  null  model  then  that  serves  as  a  reference  for  compar¬ 
ison  here  assumes  that  nodes  which  are  closer  according  to 
the  distance  function  are  more  likely  to  be  connected.  In  this 
paper  we  shall  assume  the  existence  of  a  distance  function 
d  :  V  xF  ->  5R+  that  meets  the  normal  axioms:  d(vi,  Vi)  =  0, 
d(vi,Vj )  =  d(vj,Vi ),  and  d(vi,Vj)  <  d(vi,vq)  +d(vq,Vj). 

Previously,  it  has  been  proven  that  modularity-maximization 
is  NP-hard  [2].  Clearly,  setting  V®,  f(x)  =  1,  distance  modu¬ 
larity  reduces  to  NG  modularity.  As  a  direct  consequence  of 
this  observation,  finding  a  partition  that  optimizes  distance 
modularity  is  also  NP-hard. 


Theorem  2.1.  Given  graph  G  =  (V,  E)  and  distance  func¬ 
tion  d  :  V  x  V  — >  5R+ ,  finding  a  partition  C  of  V  that  maxi¬ 
mizes  Ms  is  NP-hard. 


Throughout  this  paper,  we  will  use  an  exponential  distance- 
decay  model [14]  [5]  16]  [20,  defined  as  follows: 

f(x)  =  e-(*/->2 

Where  <7  is  a  parameter  in  the  interval  (0,  00)  and  e  is  the 
base  of  the  natural  logarithm.  One  way  to  interpret  o  is 
that  it  is  the  distance  where  the  force  exerted  by  a  point 
is  reduced  by  a  fraction  1/e  (roughly  0.36).  We  note  that 
in  the  limit  as  a  approaches  infinity,  geospatial  modularity 
reduces  to  NG  modularity.  In  the  next  section,  we  test  a 
variety  of  settings  for  a.  Learning  parameters  such  as  a  has 
previously  been  explored  in  various  geospatial  applications 
-  see  [l6  20  for  examples. 


3.  APPROACH 

This  section  describes  the  approach  we  use  to  mine  for  ge¬ 
ographically  dispersed  communities  in  networks.  Although 
modularity  maximization  is  NP-hard,  a  variety  of  practi¬ 
cal  approximation  routines  have  been  proposed  jl0  |17|  [3] 
that  experimentally  have  produced  near-optimal  partitions. 
In  this  paper,  we  employ  the  Louvain  heuristic  algorithm 
of  Blondel  et  al.  [3],  only  instead  of  using  it  to  maximize 
NG-modularity,  we  use  it  to  maximize  distance  modularity. 
In  order  to  use  the  Louvain  algorithm  to  maximize  distance 


modularity,  we  must  also  modify  some  of  it’s  steps.  We  sum¬ 
marize  the  Louvain  algorithm  briefly  next  (for  more  details 
on  this  algorithm,  see  @)  and  describe  our  modifications 
and  practical  considerations  when  employing  this  heuristic 
algorithm  to  optimize  distance  modularity. 

3.1  Heuristic  Algorithm 

The  Louvain  algorithm  is  an  iterated,  hierarchical  process 
in  which  two  phases  are  applied  repeatedly  until  maximal 
modularity  is  reached:  In  the  first  phase,  each  node  n  £  V 
of  the  given  network  is  assigned  to  a  community  c,  creat¬ 
ing  an  initial  partition.  In  [3],  the  singleton  partition  was 
used.  Then,  for  each  Vi  £  V,  the  gains  in  modularity  that 
would  result  from  placing  Vi  to  the  community  of  each  of 
its  neighbors  Vj  G  r/i  are  calculated,  and  Vi  is  removed  and 
placed  into  the  community  for  which  the  maximum  gain  in 
modularity  is  attained  (unless  no  positive  gain  in  modular¬ 
ity  is  possible) .  This  sub-process  is  repeated  sequentially  for 
each  Vi  £  V  until  no  individual  move  will  result  in  a  gain 
in  modularity,  marking  the  end  of  the  first  phase  and  giving 
a  partition  C.  In  the  second  phase,  a  new  network  is  built 
by  using  each  a  £  C  as  a  node  in  the  new  network,  call 
these  nodes  meta-nodes.  Weights  on  the  edges  between  any 
two  meta-nodes  in  the  new  network  are  assigned  to  be  the 
sum  of  the  weights  of  the  edges  between  nodes  in  the  two 
communities  corresponding  to  the  meta-nodes.  In  this  step, 
self-loops  are  created  for  each  meta-node  in  the  new  network 
from  the  links  between  nodes  of  the  community  correspond¬ 
ing  to  that  meta-node.  After  this  phase  is  complete,  the 
two  phases  are  reapplied  iteratively  until  there  are  no  more 
changes. 

The  efficiency  of  the  Louvain  algorithm  relies  on  an  easy 
re-calculation  of  modularity  in  the  first  phase  of  the  algo¬ 
rithm.  When  computing  gains  in  modularity  in  phase  one 
of  the  algorithm,  removing  any  node  vt,  the  overall  increase 
in  modularity  (regardless  if  it  is  distance  or  NG)  if  it  is 
placed  into  community  c  is  proportional  to: 


j£c 

The  only  difference  for  distance  modularity  is  that  Pij  is 
defined  as  per  Definition  |2.2|  instead  of  Definition  |2.1|  In 
terms  of  time  complexity,  the  first  phase  of  the  algorithm  is 
0(n1 2),  since  for  every  node  in  the  network,  distance  modu¬ 
larity  must  be  computed  according  to  Definition |2.2|  which 
is  0{n )  in  the  denominator  of  Pij.  The  second  phase  is  again 
O(n).  Both  phases  are  a  multiple  of  a  constant  that  results 
from  the  number  of  iterations  needed  to  run  to  completion. 
We  note  that  the  input  sizes  decrease  drastically  with  each 
iteration,  since  communities  are  iteratively  collapsed  into 
nodes.  Hence,  the  proposition  on  time  complexity  follows: 

Proposition  3.1.  The  time  complexity  of  the  Louvain  al¬ 
gorithm,  optimizing  for  distance  modularity,  is  quadratic  in 
terms  of  the  number  of  nodes  n  of  the  input  network. 

3.2  Practical  Considerations 

Apart  from  the  main  modification  to  use  distance  modu¬ 
larity  instead  of  NG  modularity,  there  are  two  steps  of  the 
original  Louvain  algorithm  that  we  modify  when  optimizing 
for  distance  modularity.  First,  we  must  decide  on  an  initial 
partition  to  use.  Blondel  et  al.  3|  use  the  singleton  partition. 
However,  we  have  found  that  using  the  Louvain  partition, 


Table  1:  Brightkite  Sample  Data 


Nodes 

Edges 

Max 

331 

2801 

Min 

300 

787 

Avg 

311.25 

1560.40 

resulting  from  a  normal  run  of  the  Louvain  algorithm  op¬ 
timizing  for  NG  modularity,  provided  better  results  at  the 
expense  of  some  runtime.  Second,  since  each  node  has  an 
associated  geospatial  value,  a  geospatial  value  must  be  as¬ 
signed  to  the  meta-nodes  of  the  new  networks  being  built. 
Here  we  use  the  centroid  of  the  communities  that  corre¬ 
spond  to  the  meta-nodes.  Throughout  the  remainder  of  this 
paper,  we  shall  refer  to  the  described  modified  version  of  the 
Louvain  algorithm  (for  maximizing  distance  modularity)  as 
the  Louvain-D  algorithm.  The  implications  of  these  con¬ 
siderations  are  discussed  in  more  detail  in  our  experimental 
results. 

4.  EXPERIMENTAL  RESULTS 

For  our  experiments,  we  used  information  extracted  from 
the  Gowalla  and  Brightkite  location-based  online  social  net¬ 
working  sites  6  . 

We  built  our  implementation  in  Python  2.6  on  top  of  the 
NetworkX  library  F]  leveraging  code  from  Thomas  Aynaud’s 
implementation  of  the  Louvain  algorithm^  Our  implemen¬ 
tation  took  approximately  1000  lines  of  code.  The  experi¬ 
ments  were  run  on  a  computer  equipped  with  an  Intel  X5677 
Xeon  Processor  operating  at  3.46  GHz  with  a  12  MB  Cache 
running  Red  Hat  Enterprise  Linux  version  6.1  and  equipped 
with  70  GB  of  physical  memory.  All  statistics  presented  in 
this  section  were  calculated  using  R  2.13.1. 

4.1  Distance  Modularity  Evaulation 

In  our  first  set  of  tests,  we  iteratively  selected  nodes  and 
their  neighbors  from  the  Brightkite  network  dataset  pro¬ 
vided  by  the  authors  of  [6]  to  produce  20  small  samples  (of 
at  least  300  nodes  each)T  Each  sample  originated  with  a 
randomly  selected  node  from  the  network  and  we  iteratively 
added  neighbors  of  the  selected  node(s)  to  the  sample  until  a 
minimum  desired  sized  was  achieved.  Node  and  edge  counts 
for  these  small  networks  is  listed  in  Table  1. 

On  our  20  samples  extracted  from  the  Brightkite  dataset, 
we  considered  the  straight-line  distance  between  nodes  in 
kilometers.  Hence,  in  calculating  geomodularity,  we  ran  ex¬ 
periments  0  =  {50, 100, 150, . . . ,  500}.  For  each  dataset  and 
each  value  of  a,  we  compared  the  distance  modularity  re¬ 
turned  by  three  approaches:  the  Louvain  algorithm  (which 
does  not  consider  any  geospatial  information),  the  Louvain- 
D  algorithm  using  singleton  nodes  as  the  initial  partition, 
and  the  Louvain-D  algorithm  using  the  result  of  the  Lou¬ 
vain  algorithm  as  the  initial  partition. 

Both  variants  of  the  Louvain-D  algorithm  returned  a  par¬ 
tition  with  a  greater  average  geomodularity  for  each  value 
of  o  than  the  partition  returned  by  the  Louvain  algorithm 
(see  Figure  [l]).  This  aligns  well  with  the  previous  results 
of  where  space  can  affect  on  community  structure  not 

1http:/ /networkx.github.com/ 

2  http :  / /perso  .crans  .org /aynaud  /  communities  / 


accounted  for  in  the  network  topology.  However,  we  noticed 
that  the  percentage  increase  in  modularity  decreases  with 
a  (see  Figure  [2|.  This  relationship  also  makes  sense  as 
distance  modularity  reduces  to  NG  modularity  (which  the 
Louvain  algorithm  is  designed  to  optimize)  as  cr  approaches 
infinity. 

Although  the  Louvain-D  algorithm  outperformed  the  Lou¬ 
vain  algorithm  in  terms  of  finding  geomodularity,  it  gener¬ 
ally  returned  higher-quality  partitions  if  it  was  initialized 
with  the  Louvain  partition  instead  of  the  partition  of  sin¬ 
gleton  nodes.  Further,  when  we  used  the  Louvain  partition 
to  initialize  the  Louvain-D  algorithm,  we  never  obtained  a 
partition  with  a  lower  geomodularity  than  the  Louvain  algo¬ 
rithm.  With  the  singleton  partition,  on  the  other  hand,  the 
Louvain-D  was  occasionally  outperformed  by  the  Louvain 
algorithm  -  particularly  for  the  higher  values  of  a. 
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. Louvain-D  Using  Louvain  Partition 


Figure  1:  er  (in  kilometers)  vs.  (average)  dis¬ 
tance  modularity  for  the  partitions  returned  by  the 
Louvain-D  and  Louvain  (baseline)  algorithms. 


However,  the  improvement  in  the  quality  when  using  the 
Louvain  partition  as  a  starting  point  comes  at  the  expense  of 
runtime.  While  the  time  to  calculate  the  Louvain  partition 
was  negligible  (normally  under  1  second  in  the  Brightkite 
tests),  using  it  as  a  starting  point  appears  to  cause  the 
Louvain-D  algorithm  to  take  longer  to  reach  convergence 
-  resulting  in  a  runtime  nearly  double  if  the  singleton  parti¬ 
tion  is  initially  used  (see  Figure  |3|. 

An  analysis  of  variance  (AN OVA)  reveals  that  there  is  a 
significant  difference  in  geomodualrity  of  the  partitions  re¬ 
turned  by  the  three  approaches  on  the  Brightkite  dataset 
(p-value  2.2  •  1CP16).  Additionally,  pairwise  analysis  con¬ 
ducted  using  Tukey’s  Honest  Significant  Difference  (HSD) 
test  indicates  that  both  instances  of  the  Louvain-D  algo¬ 
rithm  provided  results  that  differed  significantly  from  the 
Louvain  algorithm  and  each  other  with  a  probability  ap¬ 
proaching  1.0  (95%  confidence).  Additionally,  the  differ¬ 
ences  among  runtimes  were  also  significant  (ANOVA  p- value 
less  than  2.2- 10-1&)  and  pairwise  different  by  the  HSD  with 
a  probability  approaching  1.0  amongst  all  comparisons  (95% 
confidence) . 


o  (km) 


o(km) 

All  Trials  Average 


Figure  2:  a  (in  kilometers)  vs.  percent  improve¬ 
ment  in  geomodularity  (for  the  partition  returned 
by  the  Louvain-D  algorithm)  when  compared  to  the 
distance  modularity  for  the  partition  returned  by 
the  Louvain  (baseline)  algorithm.  Panel  A  shows 
this  relationship  when  the  Louvain-D  initially  uses 
the  singleton  partition  while  panel  B  shows  this  re¬ 
lationship  when  the  Louvain-D  algorithm  initially 
uses  the  Louvain  partition. 


As  an  example  of  the  type  of  result  returned  by  our  ap¬ 
proach,  we  have  included  Figure  [I]  that  illustrates  the  dif¬ 
ferences  between  a  community  returned  by  our  approach  vs. 
the  standard  Louvain  algorithm.  The  left  panel  shows  a 
group  of  individuals  near  the  San  Diego  area  that  the  Lou¬ 
vain  algorithm  identified  as  being  in  the  same  community. 
Likely,  in  this  case,  there  is  a  strong  correlation  between  ge¬ 
ographic  distance  and  connection  in  the  social  network.  The 
right  panel,  by  contrast,  shows  that  the  same  individuals  are 
placed  in  multiple,  different  communities  by  the  Louvain-D 
algorithm.  Since  relatively  high-degree  individuals  that  are 
geographically  near  each  other  have  a  higher  probability  of 
connection  in  the  null  model,  it  becomes  more  likely  for  the 
Louvain-D  algorithm  to  place  them  in  different  communities. 
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Table  2:  Gowalla  Sample  Data 
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Sample  No. 

Nodes 

Edges 

1 

301 

416 

2 

602 

1550 

3 

876 

12373 

4 

1201 

2680 

5 

1501 

3854 

6 

1801 

4887 

7 

2101 

6445 

Figure  3:  a  (in  kilometers)  vs.  (average)  runtime  of 
the  Louvain-D  algorithm  (using  both  singleton  and 
Louvain  partition  initially). 


Figure  4:  Left:  Communities  identified  using  the 
Louvain  algorithm,  Right:  Communities  found  using 
Louvain-D  (a  =  150) 

4.2  Tests  on  Larger  Samples 

In  our  second  set  of  tests,  we  iteratively  selected  nodes  and 
their  neighbors  from  the  Gowala  network  dataset  [6]  to  pro¬ 
duce  seven  samples  ranging  in  size  from  301  to  2101  nodes 
each.  Samples  were  collected  in  the  same  manner  as  with  the 
Brightkite  samples  previously  described.  Distances  between 
nodes  are  computed  in  kilometers.  Node  and  edge  counts  for 
these  small  networks  is  listed  in  Table  2.  Note  that  our  tests 
examine  networks  significantly  larger  than  those  considered 
in  related  work  where  communities  are  determined  based  on 
geography  and  network  topology  (100  nodes  in  [5]  and  571 
nodes  in  !)• 

We  evaluated  the  Louvain-D  algorithm  on  these  samples 
with  a  =  100,  initially  using  the  Louvain  partition,  and 
compared  the  distance  modularity  of  the  resulting  partition 
to  that  of  the  partition  returned  by  the  standard  Louvain 
algorithm.  With  all  seven  samples,  the  Louvain-D  algorithm 
outperformed  the  standard  approach.  Improvement  ranged 
from  2.8-14.2%.  The  results  are  depicted  in  Figure [5] 

We  also  studied  the  runtime  of  the  Louvain-D  algorithm 
and  compared  it  to  the  size  of  the  samples.  As  per  Proposi- 


Figure  5:  Distance  modularity  of  the  partition  found 
using  the  Louvain  (baseline)  and  Louvain-D  algo¬ 
rithms  for  the  Gowalla  network  samples  (see  Ta¬ 
ble^. 


tion  |3.1[  we  expected  a  quadratic  relationship.  We  verified 
this  relationship  in  our  experiment  ( R 2  =  0.9973).  These 
results  are  depicted  in  Figure  [6]  We  note  that  while  consid¬ 
ering  a  network  of  2101  nodes  required  just  under  two  days 
of  computer  time,  which  is  acceptable  for  our  applications, 
further  scaling  will  take  prohibitively  long  runtimes.  For 
example,  scaling  to  104  nodes  would  require  approximately 
three  months  of  runtime  based  on  our  regression  analysis. 
Further  scalability  is  an  important  direction  for  future  work. 

4.3  Application:  Transnational  Terrorism 

In  this  section  we  use  the  open-source  derived  terrorist 
network  of  Medina  and  Hepner  15]  as  a  proxy  for  the  (of¬ 
ten  classified)  networks  that  will  be  used  by  this  software  in 
practice.  The  networks  consists  of  358  geolocated  individu¬ 
als  in  a  transnational  terrorist  organization  (660  unweighted 
edges).  A  diagram  of  the  network  is  shown  in  Figure  FF]  while 
the  locations  of  the  individuals  are  shown  in  Figure  [8] 

We  ran  the  Louvain-D  algorithm  (initially  using  the  Lou¬ 
vain  partition)  with  a  =  {50, 100, 150, . .  .  ,  500}  and  com¬ 
pared  the  distance  modularity  of  the  resulting  partition  to 
that  returned  by  the  standard  Louvain  algorithm.  The  Louvain- 
D  algorithm  consistently  outperformed  the  baseline  approach 
(Figure [9|  with  the  percent  improvement  ranged  from  8.2  — 
9.8%.  The  results  are  consistent  with  the  other  trials,  where 
the  distance  modularity  of  the  partition  produced  by  the 
Louvain-D  partition  monotonically  decreases  with  cr,  slowly 
approaching  the  distance  modularity  of  the  baseline  approach. 

To  better  understand  how  a  practitioner  would  use  our  ap¬ 
proach  for  analysis,  we  considered  the  problem  of  identifying 
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♦  Louvain-D 
Algorithm 

— Quadratic 
Fit 

R2  =  0.99734 


Figure  6:  Networks  size  (in  nodes)  vs.  runtime  (in 
hours)  for  the  Gowalla  network  samples.  Note  the 
strong  quadratic  fit. 


Figure  7:  Network  relationships  in  the  transnational 
terrorist  dataset  of  1 15 1 . 


a  single,  important  geographically  disperse  community.  We 
can  identify  such  a  group  of  individuals  by  determining  the 
quality  of  a  given  community.  We  can  derive  such  a  measure 
directly  from  the  definition  of  modularity.  For  a  given  given 
community  cCf,we  can  determine  the  quality  as  follows: 

Me  =  wij  —  Pij  (1) 

We  ranked  all  the  communities  for  the  transnational 
terrorist  organization  (over  all  settings  of  a  we  considered) 
and  took  the  top  one.  We  show  the  visualization  of  the 
network  and  geolocations  of  the  individuals  in  Figures  m 
|10|  Note  that  the  members  of  the  identified  community 
span  three  continents.  Identifying  communities  such  as  these 
can  provide  intelligence  analysts  insight  into  how  various 
geographically-disperse  terrorist  cells  interact  with  higher- 
level  organizations. 

5.  RELATED  WORK 

The  use  of  modularity  maximization  for  community  find¬ 
ing  was  first  introduced  in  18]  which  also  described  how 


Figure  8:  Geographic  locations  of  the  individuals  in 
the  transnational  terrorist  dataset  of  15]. 
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Figure  9:  Comparison  of  distance  modularity  be¬ 
tween  Louvain  and  Louvain-D  algorithms  for  the 
transnational  terrorism  dataset. 


to  find  partitions  that  could  nearly  maximize  this  quantity. 
An  exact  method  for  addressing  this  optimization  problem 
was  introduced  in  [2].  However,  this  method  was  based  on 
integer  programming  and  for  many  problem  instances  may 
take  an  exponential  amount  of  time  to  complete.  However, 
we  note  that  an  easy  modification  of  that  program  can  be 
used  to  address  the  problem  of  this  paper  as  the  quantity 
Pij  can  be  solved  in  a  pre-processing  step  and  treated  as  a 
constant  in  the  integer  program  formulation.  Note  that  the 
time  to  complete  such  a  step  would  be  easily  dominated  by 
the  overall  runtime  to  even  approximate  a  solution  in  such  a 
method.  In  the  same  paper,  modularity  maximization  was 
also  shown  to  be  NP-hard,  which  precludes  an  efficient  ap- 


Figure  10:  Geolocations  of  the  individuals  in  the 
top-ranked  community  from  the  transnational  ter¬ 
rorist  network. 


Figure  11:  Visualization  of  the  network  topology  of 
the  community  shown  in  Figure  |10[ 


proaches  under  current  theoretical  assumptions.  In  [3]  the 
Louvain  algorithm  is  introduced  which  is  shown  to  provide 
partitions  that  nearly  maximize  modularity  and  can  scale 
to  very  large  networks.  The  modification  of  the  Louvain 
algorithm  is  what  we  leveraged  in  this  paper. 

Modularity  was  extended  to  consider  geosptial  relation¬ 
ships  using  a  distance-decay  model  in  14]  with  the  intro¬ 
duction  of  distance  modularity  which  we  use  in  this  paper. 
Their  approach  modifies  the  null  model  to  increase  the  ex¬ 
pected  number  of  edges  between  close  nodes,  it  will  tend 
to  find  communities  that  are  more  geographically  disperse  - 
hence  meeting  the  requirement  of  our  presented  application. 
Our  work  extends  on  their  theory  -  providing  an  algorithm  to 
find  an  approximately  optimal  partitions  wrt  distance  mod¬ 
ularity,  experimental  results,  and  describes  practical  consid¬ 
erations  -  none  of  which  were  included  in  14  which  only 
introduces  the  the  concept  of  distance  modularity  and  de¬ 
scribes  the  mathematical  properties  of  their  alternative  null 
model. 

The  recent  work  of  5  introduces  “spatial  modularity” 
that  also  uses  a  distance-decay  function  in  the  null  model  - 
though  somewhat  different  to  that  introduced  in  |lT].  They 
study  the  difference  among  partitions  created  by  attempt¬ 
ing  to  optimize  both  standard  modularity  and  their  alter¬ 
nate  definition  on  a  series  of  small  simulated  networks  whose 
edges  are  formed  based  on  varying  degrees  of  correlation  be¬ 
tween  space  and  node  similarity  (determined  by  randomly 
assigned  attributes).  The  results  of  that  paper  have  also  in¬ 
spired  this  work  as  they  indicate  that  by  considering  geospa¬ 
tial  relationships  in  the  null  model  often  yields  different  com¬ 
munity  structure  than  with  the  original  definition  of  mod¬ 
ularity  introduced  by  18] .  However,  unlike  this  paper,  the 
work  of  [5]  only  studies  simulated  networks  (this  paper  only 
looks  at  real-world  networks).  The  networks  of  this  paper 
are  an  order  of  magnitude  larger  as  5  only  considers  net¬ 
works  of  100  nodes.  Further,  [5]  does  not  describe  any  prac¬ 
tical  concerns  in  their  approach  that  must  be  considered 
when  creating  a  real-world  system. 

Another  important  result  on  community  finding  in  geosp¬ 
tial  networks  was  that  of  [9j  where  the  authors  also  modify 
modularity.  However,  in  that  work,  the  authors  use  a  null 
model  that  is  based  on  an  empirically  observed  probabil¬ 
ity  distribution  of  edge  existence  based  on  distance.  Their 
optimization  approach  was  tested  on  a  network  of  Belgian 
communes  of  phone  users  and  was  shown  to  accurately  iden¬ 
tify  linguistic  communities.  However,  unlike  this  paper  and 


the  work  of  14]  and  [5],  as  their  null  model  is  based  on 
an  empirically  determined  probability  distribution,  it  will 
not  necessarily  ensure  geographically-disperse  communities 
-  which  is  our  target  application.  Further,  the  work  of  9 
does  not  describe  practical  considerations  and  their  experi¬ 
mental  evaluation  is  restricted  to  the  Belgian  phone  network 
data  consisting  of  571  nodes. 

In  addition  to  the  aforementioned  approaches,  community 
detection  in  networks  has  also  been  explored  in  other  man¬ 
ners  that  could  potentially  be  proved  applicable  to  geospa¬ 
tial  applications  -  though  to  our  knowledge  no  such  applica¬ 
tion  has  been  presented  in  the  literature.  For  instance,  the 
work  of  1 21  identifies  communities  based  on  both  network 
topology  and  content  analysis.  Further,  there  are  methods 
for  community  detection  other  than  modularity  maximiza¬ 
tion  on  networks  (that  do  not  consider  spatial  interactions). 
Leveraging  one  of  these  other  approaches  is  an  important 
direction  for  future  work.  See  10|  for  a  comprehensive  sur¬ 
vey. 

There  has  been  other  recent  work  where  geospatial  net¬ 
works  have  been  explored  with  respect  to  problems  other 
than  community  finding.  The  work  of 
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discusses  link- 

prediction  and  shows  that  by  considering  geography  that 
results  for  this  problem  can  be  improved.  The  work  of  1 
looks  at  identifying  the  location  of  users  on  Twitter  using 
network  topology.  Further,  there  also  have  been  empirical 
studies  on  social  networks  with  a  spatial  component  such  as 
[2].  Along  such  lines,  the  mobility  of  users  in  a  location- 
based  social  network  is  explored  in  |6 :  [8] .  More  domain- 
specific  empirical  studies  related  to  this  work  are  also  preva¬ 
lent  in  the  literature.  Pertinent  to  our  application  include 
studies  on  terrorist  networks  [15]  and  criminal  co-offender 
networks  19  . 


6.  CONCLUSION 

In  this  paper,  we  have  presented  a  modified  Louvain  al¬ 
gorithm  to  find  partitions  of  networks  that  provide  near- 
optimal  solutions  for  both  nearness  and  distance  modularity, 
providing  a  way  to  leverage  spatial  information  in  addition 
to  network  connection  topology  when  mining  networks  for 
communities.  We  have  evaluated  this  algorithm  on  two  real- 
world  location-based  social  networks,  as  well  as  a  real-world 
transnational  terrorism  network  data  set.  Our  results  have 
shown  that  using  the  Louvain  algorithm  modified  to  opti¬ 
mize  for  distance  modularity  to  be  an  effective  approach 
to  the  problem  of  finding  geographically  disperse  communi¬ 
ties,  finding  near-optimal  solutions  to  distance  modularity. 
Our  experiments  have  also  shown  that  using  the  Louvain 
partition  instead  of  a  singleton  partition  in  the  initial  par¬ 
titioning  step  of  the  algorithm  generally  provides  improved 
final  partitions  in  terms  of  distance  modularity.  We  have 
demonstrated  the  scalability  of  the  algorithm  by  considering 
networks  of  up  to  more  than  2000  nodes,  a  number  that  is 
significantly  greater  than  network  sizes  typically  considered 
in  the  related  literature.  Finally,  particularly  through  our 
experiments  applying  the  algorithm  to  a  real-world  transna¬ 
tional  terrorism  network  data  set,  we  have  found  the  pre¬ 
sented  approach  be  useful  for  finding  geographically  disperse 
communities  at  a  time  scale  that  is  practical  in  the  applica¬ 
tion  domain. 

Currently,  examining  scalability  issues  is  an  immediate 
concern  for  future  work,  as  we  have  initiated  a  relationship 
with  a  major  American  police  department  to  study  gang  vi- 


olence  -  which  will  require  the  examination  of  networks  of 
size  105  nodes.  In  this  application  domain,  the  identification 
of  particularly  localized  communities  as  opposed  to  disperse 
communities  may  be  of  interest  as  well,  thus  a  modularity 
definition  optimizing  for  this  is  another  potential  item  for 
immediate  future  work.  We  are  also  working  with  various 
agencies  in  the  U.S.  Department  of  Defense  to  transition 
this  technology  to  study  networks  of  hundreds  to  thousands 
of  nodes.  With  this  particular  user-base,  our  focus  is  on 
readying  the  technology  for  deployment  to  analysts  in  a  us¬ 
able  system. 
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